Retaining relevant data from a [large] set of features,
projecting to a lower dimensional feature subspace without losing important information
depending on the problem to be solved.
PyData Cluj-Napoca meetup 3, 2019.05.07
Retaining relevant data from a [large] set of features,
projecting to a lower dimensional feature subspace without losing important information
depending on the problem to be solved.
- PCA is able to reconstruct a rotation of the source signal in the first 2 components
- lots of dimensions
- multiple time series with common underlying factors
Why:
- exploratory data analysis
- visualizations
- extraction of underlying (or relevant) factors
- better data separability, informative clustering
- robust learning, less overfitting, improved supervised or unsupervised models performance
- noise reduction
- anomaly detection
- data compression
- less computational resources
- why it works: successive data samples have some relation one to another
- low variance filter
- correlation based filtering
- univariate feature selection
- recursive feature elimination
- variable importance based filtering
- LDA also computes a decision boundary, it can be used as a classifier
- projections to the first component have the highest class separation (LDA decomposition focus is separation, while PCA decomposition focus is correlation)
- suitable for non-linearly separable data
- less assumptions about the distributions than LDA, QDA computes both means and covariance for each group
- the new axes define a lower dimensional subspace, the compressed dataset is built from the new coordinates of the projected points in this subspace
- iterative algorithm using shrinkage methods (lasso, ridge, elastic net)
- adds a regularization term to the loss function, penalizing non-zero coefficients
- iterative algorithms doing minimization of mutual information or maximization of non-gaussianity (negentropy or kurtosis)
|
|
|
|
- linear, polynomial, spline, radial basis function (RBF), ANOVA RBF etc.
- a RBF kernel projects into an infinite dimensional space, it is way more complex than the 2D gaussian function on the right
- kernel parameters tuning: various methods, usually grid search
- models incorporate external knowledge when applying a certain kernel or technique (including by the choice of the parameters)
|
|
|
- transforms the proximity metric to fitted distances in a lower dimensional space, such that the disparities are mapped as good as possible
- preserves topology
- uses "geodesic" distances: retains only some of the graph distances among objects (the smaller ones) and estimates all dissimilarities as shortest path distances
- "think globally, fit locally"
- computes nearest neighbors, finds a tangent space to each neighborhood and combines those to find an embedding that aligns the tangents
- measures similarity between points by converting euclidian distances to probabilities according to the normal distribution, then tries to minimize the similarities in the projected space (using KLD) according to a t-distribution
- has a cost function that includes an attractive force and a repulsive one, optimized using gradient descent
- much faster, has similar or better results than t-SNE and can transfer learning (can be used to predict)
- preserves more of the global structure
- very nice for data exploration, even when having few features (fallen out of favor recently, but remains a powerful technique)
- all tunings of neural networks apply to autoencoders: topology, cost function, type of layers, type of cells, type of activation functions, number of nodes, regularization, skip connections, dropout/dropconnect, optimizers, network initialization
- use convolutions when there is some spatial relation, [maybe] recurrent networks for time-series
- 150 input features, 1000 samples
- why not always use autoencoders? sample size needed, overfitting dangers, lots of tuning needed, interpretability
- latent layer in the middle
- reconstruction results will be similar, the difference is in the sparsity of activations and, thus, robustness and interpretability
- theoretically forces better data separation by repelling different samples in the latent space
- increased robustness
- projections sample is for MNIST
- only the loss function differs, it will penalize sensitivity of latent activations to inputs
- can be combined with other types (sample projection is for denoising-contractive AE)
- latent space is special, stores probability distributions (for each dimension a mean and a standard deviation is kept). Decoder samples from a normal distribution in the latent layer. The "reparametrization trick" allows the autoencoder to backpropagate
- semi-supervised extension: cVAE
- how beta works: simple multiplication in the VAE objective function, which seems to encourage the latent representation axes to be sensitive to changes in generative factors
- cVAEGAN: semi-supervised, conditional learning
- combining multiple networks, some with common layers: encoder, decoder, classifier and discriminator
|
|
|
- tries to find a sparse representation of the original data that best allow reconstruction, maximizing the mutual information between the data points and a noisy representation in the latent layer
- sample network is small, embedding more knowledge in it can help: deeper networks, more neurons, convolutions
|
|
|
- using special cells: GRU, LSTM, Nested LSTM
- long term dependencies problem: Bi-LSTM, attention. Maybe "attention is all you need".
- Neighbourhood Components Analysis (NCA)
- Maximum Variance Unfolding (MVU)
- Generative Topographic Mapping
- Diffusion Maps
- Twin Kernel Embedding (TKE)
- Conditional Subspace VAE (CSVAE)
- Vector Quantised VAE (VQ-VAE)
- Total Correlation VAE (beta-TCVAE)
- Independent Subspace Analysis VAE (ISA-VAE)
- Factorized Action VAE (FAVAE)
- FactorVAE
- oi-VAE
- Auto-Classifier-Encoder (ACE)
- InfoGAN
- Adversarial Information Factorization (IFcVAEGAN)